{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### 一.选择模型\n",
    "模型的选择也要基于业务需求来考虑，通常需要从模型的【精度、业务可解释性、系统上线难易度】等多个指标来做权衡，目前对于表格型数据，通常采用GBDT一类的模型就能取得，不错的效果，通常会用xgboost或者lightgbm，这里我们就基于lightgbm来做建模，关于它的原理，不在这里介绍，大家可以查看：https://github.com/zhulei227/ML_Notes 这个项目中的笔记"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 二.选择：损失函数 vs 评估指标\n",
    "\n",
    "#### 2.1   什么是损失函数\n",
    "\n",
    "损失函数，它是我们构建模型的预测$f(x)$与真实值$y$之间的一个函数($g_1$)，这个指标将会用于指导模型的训练：    \n",
    "\n",
    "$$\n",
    "loss(x,y,f)=g_1(f(x),y)\n",
    "$$   \n",
    "\n",
    "这里为了求解方便，通常要求$g_1$具有良好性质，通常至少是要可导的，对于lightgbm甚至要求$g_1$可二阶导，比如对于mse损失函数，对于$g_1$的定义：   \n",
    "\n",
    "$$\n",
    "g_1(f(x),y)=(f(x)-y)^2\n",
    "$$   \n",
    "\n",
    "而对于评估指标，它本质也是关于$f,x,y$的一个函数($g_2$)：   \n",
    "\n",
    "$$\n",
    "eval(x,y,f)=g_2(f(x),y)\n",
    "$$    \n",
    "\n",
    "而对于$g_2$就没有那么好的性质类，对于这样的评估指标：\"前top 5%的平均赔付率\",我们需要先对$f(x)$排序，去前top 5%的$y$值求平均，这样的评估指标压根不可导，甚至没法用数学表达式，所以通常会选择一个可导的$g_1$去替代$g_2$，期望通过对$g_1$的最优化来对$g_2$进行最优化，比如下面，通过最小化mse去，最大化评估指标：测试集前top 5%的平均赔付率   \n",
    "\n",
    "|||\n",
    "|---|---|\n",
    "|评估指标|测试集前top 5%的平均赔付率|\n",
    "|损失函数|mse|\n",
    "\n",
    "\n",
    "#### 2.2 如何选择最佳的损失函数\n",
    "从上面的介绍可以知道，损失函数与我们的评估指标之间存在一个差距，那么如何找到（创造）一个很好的损失函数，使得它与评估指标之间的gap尽可能地小？这个问题很麻烦，通常来说可以尝试如下的方法：    \n",
    "\n",
    "1）尝试最常用的一些损失函数，比如回归用mse,mae,分类用交叉熵；    \n",
    "\n",
    "2）对有很强分布假设的数据采用对应的损失函数，比如出险次数的分布通常符合poisson分布，损失函数可以尝试使用poisson回归的损失函数；   \n",
    "\n",
    "3）使用多种损失函数的组合，这通常需要自己去自定义实现损失函数；    \n",
    "\n",
    "下面就用lgm训练一个模型，从mse开始"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "import lightgbm as lgb\n",
    "from sklearn.metrics import mean_absolute_error, mean_squared_error, auc\n",
    "from pre_process.xxx import *\n",
    "import toad\n",
    "from sklearn import metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "diretory=\"xxx\"\n",
    "test0 = pd.read_csv(diretory+'xxx.csv', low_memory=False)\n",
    "test0.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.1加工目标:出险率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "claim = pd.read_csv(diretory+'/data/claimnum.csv')\n",
    "claim = claim[['polnum_com', 'num']]\n",
    "claim = claim.rename(columns={'polnum_com': 'policy_no'})\n",
    "test0=pd.merge(test0,claim,on=\"policy_no\",how=\"inner\")\n",
    "test0['y'] = np.where(test0['num']>0, 1, 0)\n",
    "test0=test0.drop(\"num\",axis=1)\n",
    "#平均出险率\n",
    "test0['y'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.2 数据清洗+特征工程\n",
    "\n",
    "这里的特征工程是对全局数据进行操作，需要避免数据泄露的操作"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "dat1=test0.copy()\n",
    "#因子筛选\n",
    "dat1 = ppcess_filterCols(dat1, cols='join')\n",
    "#去掉饱和度过低的因子 by 80%\n",
    "na_cols = []\n",
    "for c in dat1.columns:\n",
    "    if dat1[c].isnull().sum()/len(dat1)>0.8:\n",
    "        na_cols.append(c)\n",
    "dat1 = dat1.drop(na_cols, axis=1)\n",
    "# 去掉集团因子里面产险数据来源的因子\n",
    "jt_cx_cols = ['xxx']\n",
    "dat1 = dat1.drop(jt_cx_cols, axis=1)\n",
    "dat1 = ppcess_dateCols(dat1,print_process_cols=False) # 日期因子格式\n",
    "dat1 = ppcess_toNum(dat1,print_process_cols=False) # 部分因子转换为数值型\n",
    "dat1 = ppcess_delCols_na(dat1,print_process_cols=False) # 删去饱和度低于1%的因子\n",
    "dat1 = ppcess_inpute(dat1,print_process_cols=False) # 部分因子按规则进行空值填充，数值型因子在后面做均值填充\n",
    "dat1 = ppcess_delCols_uniValue(dat1,print_process_cols=False) # 删去全部唯一值的因子\n",
    "\n",
    "dat2 = dat1.copy()\n",
    "dat2 = feature_catValues(dat2,print_process_cols=False) # 修正部分因子取值\n",
    "dat2 = feature_factorLabel1(dat2,print_process_cols=False) # 离散特征编码1\n",
    "dat2 = feature_factorLabel2(dat2,print_process_cols=False) # 离散特征编码2\n",
    "dat2 = feature_lambda(dat2,print_process_cols=False) # 构建一些特征\n",
    "dat2 = dat2.drop(['xxx'], axis=1)\n",
    "dat2['xxx'] = dat2['xxx'].fillna(0)\n",
    "dat2['y']=test0['y']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3 切分训练集，验证集，测试集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "trn_df=dat2[test0['effdate'].apply(lambda x:x[:7]<=\"2019-10\")]\n",
    "\n",
    "dev_test_df=dat2[test0['effdate'].apply(lambda x:x[:7]>=\"2019-11\")]\n",
    "indice=list(range(0,dev_test_df.shape[0]))\n",
    "np.random.shuffle(indice)\n",
    "dev_test_df=dev_test_df.iloc[indice]\n",
    "dev_df=dev_test_df.iloc[:dev_test_df.shape[0]//2]\n",
    "test_df=dev_test_df.iloc[dev_test_df.shape[0]//2:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "trn_df.shape,dev_df.shape,test_df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "trn_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.4 target encoding\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "object_cols=trn_df.dtypes[trn_df.dtypes==object].reset_index()['index'].tolist()\n",
    "trn_df[object_cols]=trn_df[object_cols].fillna(\"missing\")\n",
    "dev_df[object_cols]=dev_df[object_cols].fillna(\"missing\")\n",
    "test_df[object_cols]=test_df[object_cols].fillna(\"missing\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "object_target_cols=[]\n",
    "for col in object_cols:\n",
    "    object_target_cols.append(col+\"_target\")\n",
    "    trn_df[col+\"_target\"]=trn_df[col]\n",
    "    dev_df[col+\"_target\"]=dev_df[col]\n",
    "    test_df[col+\"_target\"]=test_df[col]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import category_encoders as ce\n",
    "le=ce.TargetEncoder(cols=object_target_cols)\n",
    "le.fit(trn_df,trn_df['y'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "trn_df=le.transform(trn_df)\n",
    "dev_df=le.transform(dev_df)\n",
    "test_df=le.transform(test_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####  3.5 ordinary encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "oe=ce.OrdinalEncoder()\n",
    "oe.fit(trn_df,cols=object_cols)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "trn_df=oe.transform(trn_df)\n",
    "dev_df=oe.transform(dev_df)\n",
    "test_df=oe.transform(test_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于离散型变量，往往有许多编码方式，主要分为两类：    \n",
    "1）一类是不与目标y交互，比如one-hot编码，ordinary encoding...    \n",
    "2）另一类会与目标y交互，生成一些统计量，比如上面的target encoding,catboost...   \n",
    "\n",
    "通常与y交互会增加当前特征的信息量，比如下面从相关性的角度去做一个对比"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "corr_df=trn_df[object_cols+object_target_cols+['y']].corr().reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ordinary_corr=[]\n",
    "target_encode_corr=[]\n",
    "for col in object_cols:\n",
    "    ordinary_corr.append(corr_df[corr_df['index']==col]['y'].tolist()[0]) \n",
    "    target_encode_corr.append(corr_df[corr_df['index']==col+\"_target\"]['y'].tolist()[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "corr_df=pd.DataFrame({\"col_name\":object_cols,\"ordinary_corr\":ordinary_corr,\"target_encode_corr\":target_encode_corr})\n",
    "corr_df[\"相对提升\"]=corr_df['target_encode_corr'].abs()/corr_df['ordinary_corr'].abs()-1.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "corr_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 四.模型训练"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#自定义metrics\n",
    "def eval_function(y_true,y_pred):\n",
    "    try:\n",
    "        y_pred=y_pred.get_label()\n",
    "    except:\n",
    "        pass\n",
    "    sort_indice=np.argsort(y_pred)[::-1]\n",
    "    metric_value=y_true[sort_indice[:int(0.05*len(y_true))]].mean()\n",
    "    return \"eval_function\",metric_value,True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def eval(y_true,y_pred):\n",
    "    return np.round(eval_function(y_true,y_pred)[1]/eval_function(y_true,y_true)[1],2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 超参\n",
    "params = {\n",
    "'boosting_type':'gbdt',#集成方式，还包括rf,dart,goss等\n",
    "'objective':'binary',\n",
    "'metric':\"eval_function\",\n",
    "'learning_rate':0.1,#学习率\n",
    "'max_depth':3,#单颗树的最大深度\n",
    "'num_leaves':20,#叶子节点数\n",
    "'max_bins':32,#分箱数\n",
    "'lambda_l1':0.1,#l1正则化权重\n",
    "'lambda_l2':0.1,#l2正则化权重\n",
    "'min_data_in_leaf':20,#叶子节点最小记录数\n",
    "'feature_fraction':0.85,#特征抽取比例\n",
    "}\n",
    "\n",
    "trn_x = trn_df.drop(['policy_no','y'], axis=1)\n",
    "trn_y = trn_df['y']\n",
    "\n",
    "val_x = dev_df.drop(['policy_no','y'], axis=1)\n",
    "val_y = dev_df['y']\n",
    "\n",
    "trn_data = lgb.Dataset(trn_x, trn_y)\n",
    "val_data = lgb.Dataset(val_x, val_y)\n",
    "\n",
    "reg = lgb.train(params = params,\n",
    "                train_set = trn_data,\n",
    "                num_boost_round = 200,#最大树数量\n",
    "                early_stopping_rounds = 20,#如何验证集效果在20轮中没有明显变好，就终止\n",
    "                feval=eval_function,\n",
    "                valid_sets = [val_data])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "训练集预测结果 0.51\n",
      "验证集预测结果 0.3\n",
      "测试集预测结果 0.32\n"
     ]
    }
   ],
   "source": [
    "test_x = test_df.drop(['policy_no','y'], axis=1)\n",
    "test_y = test_df['y']\n",
    "print(\"训练集预测结果\",eval(trn_y.values,reg.predict(trn_x)))\n",
    "print(\"验证集预测结果\",eval(val_y.values,reg.predict(val_x)))\n",
    "print(\"测试集预测结果\",eval(test_y.values,reg.predict(test_x)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "根据预测我们该接下来该如何调参数？可能会有如下策略    \n",
    "1）调整树的深度   \n",
    "2）调整树的数量    \n",
    "3）调整叶子节点数量    \n",
    "4）修改目标函数   \n",
    "5）修改集成方式    \n",
    "6）调整学习率     \n",
    "7）添加训练数据    \n",
    "8）调整正则化权重  \n",
    "9）对坏样本进行归类分析  \n",
    "10）添加特征工程  \n",
    ".......\n",
    "这些策略，我们该如何选择？可以采用如下步骤：   \n",
    "1）找问题：首先，我们需要找到影响目前模型性能的主要问题；    \n",
    "2）选方法：然后选择具有针对性的方法去优化，这里往往也需要权衡选择，比如某些方法可能会有较高的收益，但是会耗费很多时间，所以这里需要一个预判，哪些方法现对花费时间少，而收益更高；     \n",
    "3）做实验：组织好代码结构，验证方法的收益；   \n",
    "4）重复1~3，循环下去  \n",
    "\n",
    "这部分的内容放到后面“迭代”这一部分讲解，下面聊一下借助于lgb，我们还可以做一些有助于后续模型优化的分析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 五.模型分析   \n",
    "\n",
    "#### 5.1 全局：特征重要度  \n",
    "决策树模型通常可以帮助我们判断特征的重要度，往往有如下的几种评估方式：    \n",
    "1）importance_type=split（默认值）:特征被切分过的次数；    \n",
    "2）importance_type=gain:特征被切分时的loss下降值；  \n",
    "3）importance_type=cover:特征被切分时的样本量(对于mse损失函数)；   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# importance_df=pd.DataFrame({\"imp\":reg.feature_importance(importance_type=\"split\"),\"col\":trn_x.columns})\n",
    "# importance_df.sort_values(\"imp\",ascending=False)[:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# importance_df=pd.DataFrame({\"imp\":reg.feature_importance(importance_type=\"gain\"),\"col\":trn_x.columns})\n",
    "# importance_df.sort_values(\"imp\",ascending=False)[:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.2 局部：贡献归因    \n",
    "计算每个特征对于预测值的贡献多少，最常用的是shaply，它具有很好的性质，下面是预测的训练集的第一条数据，可以发现每个因子贡献再加上期望值就等于模型的预测值(不过要加上最后的激活函数)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# contrib_df=pd.DataFrame({\"contrib\":reg.predict(trn_x.loc[0],pred_contrib=True)[0].tolist(),\"col\":trn_x.columns.tolist()+['_bias']})\n",
    "# contrib_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.16973732])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reg.predict(trn_x.loc[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.16973732195200125"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "1.0/(1.0+np.exp(-1*contrib_df['contrib'].sum()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5.3 多因子分析    \n",
    "决策树本质将整个样本集通过特征划分为许许多多的小区域，每一个叶子节点就代表了一个小区域，了解这些小区域内目标均值等统计量，可以帮助我们做进一步的分析或者特征工程"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#读取权重信息\n",
    "#树id,叶子节点id,叶子节点路径,叶子节点cover,叶子节点取值\n",
    "import pandas as pd\n",
    "class TreeLeafPath(object):\n",
    "    def __init__(self,feature_names):\n",
    "        self.info=[]\n",
    "        self.feature_names=feature_names\n",
    "        \n",
    "    def CalInfo(self,total_tree):\n",
    "        def G(tree,tree_index,path):\n",
    "            if tree.get('leaf_value') is not None:\n",
    "                # 叶子节点\n",
    "                self.info.append([tree_index,tree.get('leaf_index'),path,tree.get(\"leaf_count\"),tree.get(\"leaf_value\")])\n",
    "            else:\n",
    "                # 读取路径信息\n",
    "                split_feature_id=tree['split_feature']\n",
    "                try:\n",
    "                    threshold=np.round(float(tree['threshold']),2)\n",
    "                except:\n",
    "                    threshold=tree['threshold']\n",
    "                if path==\"\":\n",
    "                    G(tree['left_child'],tree_index,path+self.feature_names[split_feature_id]+\"<=\"+str(threshold))\n",
    "                    G(tree['right_child'],tree_index,path+self.feature_names[split_feature_id]+\">\"+str(threshold))\n",
    "                else:\n",
    "                    G(tree['left_child'],tree_index,path+\"&\"+self.feature_names[split_feature_id]+\"<=\"+str(threshold))\n",
    "                    G(tree['right_child'],tree_index,path+\"&\"+self.feature_names[split_feature_id]+\">\"+str(threshold))\n",
    "                \n",
    "        for item in total_tree['tree_info']:\n",
    "            tree = item['tree_structure']\n",
    "            G(tree,item['tree_index'],\"\")\n",
    "        rst=pd.DataFrame(self.info,columns=[\"tree_index\",\"leaf_index\",\"path\",\"leaf_count\",\"leaf_value\"])\n",
    "        rst=rst.sort_values(['tree_index','leaf_index'])\n",
    "        return rst"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tlp=TreeLeafPath(feature_names=trn_df.columns.tolist())\n",
    "tree_info=tlp.CalInfo(reg.dump_model())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tree_info['predict_value']=tree_info['leaf_value'].apply(lambda x:1.0/(1.0+np.exp(-1*x)))#sigmoid"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# tree_info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
